Exploring reliability of exascale systems through simulations
نویسندگان
چکیده
Exascale computers are predicted to emerge by the end of this decade with millions of nodes and billions of concurrent cores/threads. One of the most critical challenges for exascale computing is how to effectively and efficiently maintain the system reliability. Checkpointing is the state-of-theart technique for high-end computing system reliability that has proved to work well for current petascale scales. This paper investigates the suitability of checkpointing mechanism for exascale computers, across both parallel filesystems and distributed filesystems. We built a model to emulate exascale systems, and developed a simulator, RXSim, to study its reliability and efficiency. Experiments show that the overall system efficiency and availability would go towards zero as system scales approach exascale with checkpointing mechanism on parallel filesystems. However, the simulations suggest that a distributed filesystem with local persistent storage would offer excellent scalability and aggregate bandwidth, enabling efficient checkpointing at exascale.
منابع مشابه
Mero: Co-Designing an Object Store for Extreme Scale
Within the HPC community, there is consensus that Exascale computing will be plagued with issues related to data I/O performance and data storage infrastructure reliability, caused primarily by the growing gap between compute and storage performance, and the ever increasing volumes of data generated by scientific simulations, instruments and sensors. The architectural assumptions for extreme co...
متن کاملFast Exploration of Silicon Photonic Network Designs for Exascale Systems
An approach for exploring the potential applications and performance-energy benefits of silicon photonic technology in future computing systems with a particular focus on Exascale design considerations is presented.
متن کاملPerformance Impacts with Reliable Parallel File Systems at Exascale Level
The introduction of Exascale storage into production systems will lead to an increase on the number of storage servers needed by parallel file systems. In this scenario, parallel file system designers should move from the current replication configurations to the more space and energy efficient erasure-coded configurations between storage servers. Unfortunately, the current trends on energy eff...
متن کاملProgramming Environments for Exascale
The Department of Energy’s exascale software stack program (X-Stack) is exploring novel ideas for programming exascale machines in the 2023 time frame. Driven by power constraints and diminishing returns on traditional uniprocessor scaling the architectural landscape is undergoing a fairly radical transformation relative to the networks of single core and simple multicore systems of the past de...
متن کاملSimHEC: Understanding Application Efficiency at Exascales through Simulations
It is expected that our HEC system will enter exascale era in decade, which is one thousand times of performance as today’s system (petascale). In the mean time, many challenges also have been noticed and pointed out, as the size of HEC system increased without some dispensable improving on architecture of today’s HEC system, the systems could collapse at exascale, because the functionality wou...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013